A Comparison of Two Novel Algorithms for Clustering Web Documents
نویسندگان
چکیده
In this paper we investigate the clustering of web document collections using two variants of the popular kmeans clustering algorithm. The first variant is the global k-means method, which computes “good” initial cluster centers deterministically rather than relying on random initialization. The second variant allows for the use of graphs as fundamental representations of data items instead of the simpler vector model. We perform experiments comparing global k-means with random initialization using both the graph-based and the vectorbased representations. Experiments are carried out on two web document collections and performance is evaluated using two clustering performance measures.
منابع مشابه
A density based clustering approach to distinguish between web robot and human requests to a web server
Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data ...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملA Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering
This work investigated three techniques to automatically cluster a collection of documents: Word-Intersection with GQF, Word-Intersection with hierarchical agglomerative clustering, and TreeClustering. The Word-Intersection algorithms have been previously described in the literature while the TreeClustering technique is novel to this work. The TreeCluster algorithm idea comes from rule inductio...
متن کاملClustering web documents using co-citation, coupling, incoming, and outgoing hyperlinks: a comparative performance analysis of algorithms
Querying search engines with the keyword ”jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of docu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003